11 research outputs found

    Reliability-oriented resource management for High-Performance Computing

    Get PDF
    Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies

    Tutorial applications for Verification, Validation and Uncertainty Quantification using VECMA toolkit

    Get PDF
    The VECMA toolkit enables automated Verification, Validation and Uncertainty Quantification (VVUQ) for complex applications that can be deployed on emerging exascale platforms and provides support for software applications for any domain of interest. The toolkit has four main components including EasyVVUQ for VVUQ workflows, FabSim3 for automation and tool integration, MUSCLE3 for coupling multiscale models and QCG tools to execute application workflows on high performance computing (HPC). A more recent addition to the VECMAtk is EasySurrogate for various types of surrogate methods. In this paper, we present five tutorials from different application domains that apply these VECMAtk components to perform uncertainty quantification analysis, use surrogate models, couple multiscale models and execute sensitivity analysis on HPC. This paper aims to provide hands-on experience for practitioners aiming to test and contrast with their own applications

    TEXTAROSSA: Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale

    Get PDF
    International audienceTo achieve high performance and high energy efficiency on near-future exascale computing systems, three key technology gaps needs to be bridged. These gaps include: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetics; methods andtools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA aims at tackling this gap through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models and tools derived from European research

    Methods to Load Balance a GCR Pressure Solver Using a Stencil Framework on Multi- and Many-Core Architectures

    No full text
    The recent advent of novel multi- and many-core architectures forces application programmers to deal with hardware-specific implementation details and to be familiar with software optimisation techniques to benefit from new high-performance computing machines. Extra care must be taken for communication-intensive algorithms, which may be a bottleneck for forthcoming era of exascale computing. This paper aims to present a high-level stencil framework implemented for the EULerian or LAGrangian model (EULAG) that efficiently utilises multi- and many-cores architectures. Only an efficient usage of both many-core processors (CPUs) and graphics processing units (GPUs) with the flexible data decomposition method can lead to the maximum performance that scales the communication-intensive Generalized Conjugate Residual (GCR) elliptic solver with preconditioner

    Challenges in Deeply Heterogeneous High Performance Systems

    Get PDF
    RECIPE (REliable power and time-ConstraIntsaware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous acceleratorbased systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project, which span run-time management, heterogeneous computing architectures, HPC memory/interconnection infrastructures, thermal modelling, reliability, programming models, and timing analysis. For each of these areas, the paper describes the relevant state of the art as well as the specific actions that the project will take to effectively address the identified technological challenge

    The RECIPE approach to challenges in deeply heterogeneous high performance systems

    Get PDF
    [EN] RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.The activities described in this article received funding from the European Union's Horizon 2020 research and innovation programme under the FETHPC grant agreement no. 801137 RECIPE: REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems.Agosta, G.; Fornaciari, W.; Atienza, D.; Canal, R.; Cilardo, A.; Flich Cardo, J.; Hernández Luz, C.... (2020). The RECIPE approach to challenges in deeply heterogeneous high performance systems. Microprocessors and Microsystems. 77:1-13. https://doi.org/10.1016/j.micpro.2020.103185S11377Flich, J., Agosta, G., Ampletzer, P., Alonso, D. A., Brandolese, C., Cappe, E., … Zoni, D. (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems, 61, 154-170. doi:10.1016/j.micpro.2018.05.011https://euroexa.eu.https://www.altera.com/products/sip/memory/stratix-10-mx/overview.html.http://www.mango-project.eu.https://www.infinibandta.org/infiniband-roadmap/.Reghenzani, F., Massari, G., & Fornaciari, W. (2018). chronovise: Measurement-Based Probabilistic Timing Analysis framework. Journal of Open Source Software, 3(28), 711. doi:10.21105/joss.00711Abella, J., Padilla, M., Castillo, J. D., & Cazorla, F. J. (2017). Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation. ACM Transactions on Design Automation of Electronic Systems, 22(4), 1-29. doi:10.1145/3065924https://lanl.gov/projects/trinity/specifications.php.https://www.bsc.es/marenostrum/marenostrum/technical-information.https://www.olcf.ornl.gov/olcf-resources/compute-systems/titan/.Bellasi, P., Massari, G., & Fornaciari, W. (2015). Effective Runtime Resource Management Using Linux Control Groups with the BarbequeRTRM Framework. ACM Transactions on Embedded Computing Systems, 14(2), 1-17. doi:10.1145/2658990Egwutuoha, I. P., Levy, D., Selic, B., & Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3), 1302-1326. doi:10.1007/s11227-013-0884-0Lee, K., & Wong, S. S. (2017). Fault-Tolerant FPGA with Column-Based Redundancy and Power Gating Using RRAM. IEEE Transactions on Computers, 66(6), 946-956. doi:10.1109/tc.2016.2634533Cheatham, J. A., Emmert, J. M., & Baumgart, S. (2006). A survey of fault tolerant methodologies for FPGAs. ACM Transactions on Design Automation of Electronic Systems, 11(2), 501-533. doi:10.1145/1142155.1142167Parris, M. G., Sharma, C. A., & Demara, R. F. (2011). Progress in autonomous fault recovery of field programmable gate arrays. ACM Computing Surveys, 43(4), 1-30. doi:10.1145/1978802.1978810A. Iranfar, F. Terraneo, W.A. Simon, L. Dragic, I. Pilji, M. Zapater Sancho, W. Fornaciari, M. Kovac, D. Atienza Alonso, Thermal characterization of next-generation workloads on heterogeneous MPSoCs (2017).Zoni, D., & Fornaciari, W. (2015). Modeling DVFS and Power-Gating Actuators for Cycle-Accurate NoC-Based Simulators. ACM Journal on Emerging Technologies in Computing Systems, 12(3), 1-24. doi:10.1145/2751561Curtsinger, C., & Berger, E. D. (2013). STABILIZER. ACM SIGARCH Computer Architecture News, 41(1), 219-228. doi:10.1145/2490301.2451141Kormann, J., Rodríguez, J. E., Gutierrez, N., Ferrer, M., Rojas, O., de la Puente, J., … Cela, J. M. (2016). Toward an automatic full-wave inversion: Synthetic study cases. The Leading Edge, 35(12), 1047-1052. doi:10.1190/tle35121047.1Fusi, M., Mazzocchetti, F., Farres, A., Kosmidis, L., Canal, R., Cazorla, F. J., & Abella, J. (2020). On the Use of Probabilistic Worst-Case Execution Time Estimation for Parallel Applications in High Performance Systems. Mathematics, 8(3), 314. doi:10.3390/math8030314D.W. Wright, R.A. Richardson, W. Edeling, J. Lakhlili, R.C. Sinclair, V. Jacauskas, D. Suleimenova, B. Bosak, M. Kulczewski, T. Piontek, P. Kopta, I. Chirca, H. Arabnejad, O.O. Luk, O. Hoenen, J. Weglarz, D. Crommelin, D. Groen, Building confidence in simulation: Application of easyvvuq, Submitted to Journal of Advanced Theory and Simulations on 12/12/2019

    Challenges in deeply heterogeneous high performance systems

    No full text
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project, which span run-time management, heterogeneous computing architectures, HPC memory/interconnection infrastructures, thermal modelling, reliability, programming models, and timing analysis. For each of these areas, the paper describes the relevant state of the art as well as the specific actions that the project will take to effectively address the identified technological challenges.Peer Reviewe

    Towards EXtreme scale technologies and accelerators for euROhpc hw/Sw supercomputing applications for exascale: The TEXTAROSSA approach

    Get PDF
    In the near future, Exascale systems will need to bridge three technology gaps to achieve high performance while remaining under tight power constraints: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetic; methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA addresses these gaps through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models, and tools derived from European research
    corecore